We Bet, You’ll Bet!!!

Group 25 (The Pitchers!): Arjun Bajpai, Jalvi Sheta, Pushkar Goyal

Section1: Introduction

About Us:

The three of us are huge cricket fans. We come from India and cricket is not just a game there, it's a religion. When we got to know that we need to do a project using Python, all of us instantly connected with the topic and were amazed by the idea of making this team project a fun project. There are a couple of reasons that come to our minds when we think about the project’s importance. First, how cool would it be if we could actually predict the outcomes of cricket matches and see if a team is able to win against all odds. Second, this will be helpful for the advertisement industry who rely heavily on the game results to achieve the desired results from their marketing campaigns.

About the Project

Four years is a long time for a team to go from zero to hero. This is exactly what happened with current World Cup champions England. England, in the 2015 World Cup, got eliminated in the group stages, however come 2019, they were crowned champions of the world in their own backyard.

  • In this project, we would analyze and plot the journey of teams that participated in the 2019 Cricket World Cup by using all the data we collected from the 'espncricinfo' website. We would try to understand and analyse which team would have been the ideal team to place bets on in the ICC 2019 Cricket World Cup.

We wish to analyse the data and draw various conclusions based on different types of data visualizations to help us understand how a particular team/player performs under certain conditions. The conditions include; the venue of the tournament, that is, whether they are playing at home or away from home, what kind of pitches they are playing on and how many tosses the team wins or loses. We would also take into consideration the number of matches a team has won at chasing or defending a total.

Questions of Interest

  • Which team had the best chance of winning the World Cup - 2019(based on the model)?
  • Who were the best bowlers and batsmen?

Basics About Cricket

During the Project showcase, we realised that not many of our classmates watched Cricket so we thought to mention some fundamental concepts/rules about cricket that would help them understand the project better.

  • Two teams (11 players each) face each other in a 50 overs game of Cricket. Each team gets to bat for 50 overs.
  • The side that wins the Toss, elects to Bat/Bowl first
  • The aim of the team Batting first is to score as many runs as possible in those 50 overs (that is 300 balls, each over has 6 balls) so that the team bating second cannot beat that score.
  • The aim of the bowling team is to restrict the batting team to the least possible score. This can be done by taking regular wickets and mantaining good bowling statistics.
  • Wickets: The batting team has 10 wickets in hand. That is 10 players get to bat from each team.

  • Also refer to 'https://www.youtube.com/watch?v=g-beFHld19c'. This is a link to a short YouTube video that explains the game of Cricket in just 3.5 Minutes.

Section2: Data Acquisition and Cleaning Code

We have used web scraping to collect data from stats.espncricinfo.com, which has all the data of the past cricket matches and players. We scraped data for the past six years and saved it into multiple csv files. We required data of only the top ten teams that played the World Cup, but the website had the data for all the teams that play cricket. To clean the data and get the data for only the desired ten teams, we first removed the data of other unwanted teams. Second, the World Cup was played in England so we filtered the data for all the matches played only in England. Also, we merged the data from six different years (2014 to 2019), six different files into one CSV file becuse the file structure was same and we could now work on one file and get the stats for all the different years by indexing just one file. We also gathered information about the grounds in England where the 2019 World Cup was played and also the batsmen and the bowlers' data who played the world cup.

The code below shows the process of scraping. The files were scraped from the website and directly stored as CSV files on the local system

In [1]:
from bs4 import BeautifulSoup
import requests
from numpy import nan as NA
import numpy as np
import pandas as pd

url = 'http://stats.espncricinfo.com/ci/engine/records/team/series_results.html?class=2;id=201;type=decade'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Winner' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ SeriesResults.csv')
In [2]:
url = 'http://stats.espncricinfo.com/ci/engine/records/team/results_summary.html?class=2;id=201;type=decade'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Team' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ MatchResults.csv')
In [3]:
url = 'http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2015;type=year'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Team 1' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2015.csv')
In [4]:
url = 'http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2016;type=year'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Team 1' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2016.csv')
In [5]:
url = 'http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2017;type=year'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Team 1' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2017.csv')
In [6]:
url = 'http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2018;type=year'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Team 1' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2018.csv')
In [7]:
url = 'http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2019;type=year'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Team 1' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2019.csv')
In [8]:
url = 'http://stats.espncricinfo.com/ci/engine/records/team/highest_innings_totals.html?class=2;id=201;type=decade'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Team' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ HighestTotals.csv')
In [9]:
url = 'http://stats.espncricinfo.com/ci/engine/records/team/lowest_innings_totals.html?class=2;id=201;type=decade'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Team' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ LowestTotals.csv')
In [10]:
url = 'http://stats.espncricinfo.com/ci/engine/records/fielding/most_catches_career.html?class=2;id=201;type=decade'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Player' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ MostCatches.csv')
In [11]:
url = 'http://stats.espncricinfo.com/ci/engine/stats/index.html?class=2;spanmax2=12+Dec+2019;spanmin2=12+Dec+2012;spanval2=span;template=results;type=aggregate;view=ground'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Mat' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ GroundNumbers.csv')
In [12]:
url = 'http://stats.espncricinfo.com/ci/content/records/283878.html'
page = requests.get(url)

readPandas = pd.read_html(url, match ='Mat' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Match_Results.csv')

Extracting top 50 totals by a team

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [14]:
import pandas as pd
highest_totals = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ HighestTotals.csv')
In [15]:
# Highest 50 totals made by any team in ODI
highest_totals
Out[15]:
Unnamed: 0 Team Score Overs RR Inns Unnamed: 5 Opposition Ground Match Date Scorecard
0 0 England 481/6 50.0 9.62 1 NaN v Australia Nottingham 19 Jun 2018 ODI # 4011
1 1 England 444/3 50.0 8.88 1 NaN v Pakistan Nottingham 30 Aug 2016 ODI # 3773
2 2 South Africa 439/2 50.0 8.78 1 NaN v West Indies Johannesburg 18 Jan 2015 ODI # 3583
3 3 South Africa 438/4 50.0 8.76 1 NaN v India Mumbai 25 Oct 2015 ODI # 3700
4 4 India 418/5 50.0 8.36 1 NaN v West Indies Indore 8 Dec 2011 ODI # 3223
5 5 England 418/6 50.0 8.36 1 NaN v West Indies St George's 27 Feb 2019 ODI # 4099
6 6 Australia 417/6 50.0 8.34 1 NaN v Afghanistan Perth 4 Mar 2015 ODI # 3623
7 7 South Africa 411/4 50.0 8.22 1 NaN v Ireland Canberra 3 Mar 2015 ODI # 3621
8 8 South Africa 408/5 50.0 8.16 1 NaN v West Indies Sydney 27 Feb 2015 ODI # 3616
9 9 England 408/9 50.0 8.16 1 NaN v New Zealand Birmingham 9 Jun 2015 ODI # 3654
10 10 India 404/5 50.0 8.08 1 NaN v Sri Lanka Kolkata 13 Nov 2014 ODI # 3544
11 11 India 401/3 50.0 8.02 1 NaN v South Africa Gwalior 24 Feb 2010 ODI # 2962
12 12 South Africa 399/6 50.0 7.98 1 NaN v Zimbabwe Benoni 22 Oct 2010 ODI # 3061
13 13 England 399/9 50.0 7.98 1 NaN v South Africa Bloemfontein 3 Feb 2016 ODI # 3732
14 14 Pakistan 399/1 50.0 7.98 1 NaN v Zimbabwe Bulawayo 20 Jul 2018 ODI # 4020
15 15 New Zealand 398/5 50.0 7.96 1 NaN v England The Oval 12 Jun 2015 ODI # 3655
16 16 England 397/6 50.0 7.94 1 NaN v Afghanistan Manchester 18 Jun 2019 ODI # 4163
17 17 New Zealand 393/6 50.0 7.86 1 NaN v West Indies Wellington 21 Mar 2015 ODI # 3643
18 18 India 392/4 50.0 7.84 1 NaN v Sri Lanka Mohali 13 Dec 2017 ODI # 3941
19 19 West Indies 389 48.0 8.10 2 NaN v England St George's 27 Feb 2019 ODI # 4099
20 20 England 386/6 50.0 7.72 1 NaN v Bangladesh Cardiff 8 Jun 2019 ODI # 4153
21 21 Pakistan 385/7 50.0 7.70 1 NaN v Bangladesh Dambulla 21 Jun 2010 ODI # 2998
22 22 South Africa 384/6 50.0 7.68 1 NaN v Sri Lanka Centurion 10 Feb 2017 ODI # 3834
23 23 India 383/6 50.0 7.66 1 NaN v Australia Bengaluru 2 Nov 2013 ODI # 3428
24 24 India 381/6 50.0 7.62 1 NaN v England Cuttack 19 Jan 2017 ODI # 3821
25 25 West Indies 381/3 50.0 7.62 1 NaN v Ireland Dublin 5 May 2019 ODI # 4128
26 26 Australia 381/5 50.0 7.62 1 NaN v Bangladesh Nottingham 20 Jun 2019 ODI # 4166
27 27 Australia 378/5 50.0 7.56 1 NaN v New Zealand Canberra 6 Dec 2016 ODI # 3812
28 28 Sri Lanka 377/8 50.0 7.54 1 NaN v Ireland Dublin (Malahide) 18 Jun 2016 ODI # 3749
29 29 India 377/5 50.0 7.54 1 NaN v West Indies Mumbai (BS) 29 Oct 2018 ODI # 4063
30 30 Australia 376/9 50.0 7.52 1 NaN v Sri Lanka Sydney 8 Mar 2015 ODI # 3629
31 31 Pakistan 375/3 50.0 7.50 1 NaN v Zimbabwe Lahore 26 May 2015 ODI # 3651
32 32 India 375/5 50.0 7.50 1 NaN v Sri Lanka Colombo (RPS) 31 Aug 2017 ODI # 3908
33 33 New Zealand 373/8 50.0 7.46 1 NaN v Zimbabwe Napier 9 Feb 2012 ODI # 3234
34 34 England 373/3 50.0 7.46 1 NaN v Pakistan Southampton 11 May 2019 ODI # 4133
35 35 South Africa 372/6 49.2 7.54 2 NaN v Australia Durban 5 Oct 2016 ODI # 3790
36 36 New Zealand 372/6 50.0 7.44 1 NaN v Zimbabwe Whangarei 6 Feb 2012 ODI # 3232
37 37 West Indies 372/2 50.0 7.44 1 NaN v Zimbabwe Canberra 24 Feb 2015 ODI # 3612
38 38 Australia 371/6 50.0 7.42 1 NaN v South Africa Durban 5 Oct 2016 ODI # 3790
39 39 Scotland 371/5 50.0 7.42 1 NaN v England Edinburgh 10 Jun 2018 ODI # 4008
40 40 New Zealand 371/7 50.0 7.42 1 NaN v Sri Lanka Mount Maunganui 3 Jan 2019 ODI # 4074
41 41 India 370/4 50.0 7.40 1 NaN v Bangladesh Dhaka 19 Feb 2011 ODI # 3100
42 42 New Zealand 369/5 50.0 7.38 1 NaN v Pakistan Napier 3 Feb 2015 ODI # 3598
43 43 Australia 369/7 50.0 7.38 1 NaN v Pakistan Adelaide 26 Jan 2017 ODI # 3826
44 44 England 369/9 50.0 7.38 1 NaN v West Indies Bristol 24 Sep 2017 ODI # 3915
45 45 South Africa 369/6 50.0 7.38 1 NaN v Bangladesh East London 22 Oct 2017 ODI # 3929
46 46 Sri Lanka 368/4 50.0 7.36 1 NaN v Pakistan Hambantota 26 Jul 2015 ODI # 3672
47 47 South Africa 367/5 50.0 7.34 1 NaN v Sri Lanka Cape Town 7 Feb 2017 ODI # 3833
48 48 England 366/8 50.0 7.32 2 NaN v India Cuttack 19 Jan 2017 ODI # 3821
49 49 Sri Lanka 366/6 50.0 7.32 1 NaN v England Colombo (RPS) 23 Oct 2018 ODI # 4058
In [16]:
# displaying teams that have scored these highest totals
highest_total_country = highest_totals.groupby('Team').count()['Score']
highest_total_country
Out[16]:
Team
Australia        6
England         10
India            9
New Zealand      6
Pakistan         3
Scotland         1
South Africa     9
Sri Lanka        3
West Indies      3
Name: Score, dtype: int64

Creating a dataframe of results of ODI matches of the years 2015, 2016, 2017, 2018 and 2019

In [17]:
# importing the Results dataset of different years
Results_2015 = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2015.csv')
Results_2016 = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2016.csv')
Results_2017 = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2017.csv')
Results_2018 = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2018.csv')
Results_2019 = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2019.csv')
In [18]:
# concating these dataframes
Results = pd.concat([Results_2015, Results_2016, Results_2017, Results_2018, Results_2019])
In [19]:
Results.drop(["Unnamed: 0"], axis = 1, inplace = True)
In [20]:
Results.reset_index()
Out[20]:
index Team 1 Team 2 Winner Margin Ground Match Date Scorecard
0 0 Afghanistan Scotland Afghanistan 8 wickets ICCA Dubai Jan 8, 2015 ODI # 3572
1 1 Afghanistan Ireland Ireland 3 wickets Dubai (DSC) Jan 10, 2015 ODI # 3573
2 2 New Zealand Sri Lanka New Zealand 3 wickets Christchurch Jan 11, 2015 ODI # 3574
3 3 Ireland Scotland Ireland 3 wickets Dubai (DSC) Jan 12, 2015 ODI # 3575
4 4 Afghanistan Scotland Scotland 150 runs Abu Dhabi Jan 14, 2015 ODI # 3576
5 5 New Zealand Sri Lanka Sri Lanka 6 wickets Hamilton Jan 15, 2015 ODI # 3577
6 6 Australia England Australia 3 wickets Sydney Jan 16, 2015 ODI # 3578
7 7 South Africa West Indies South Africa 61 runs Durban Jan 16, 2015 ODI # 3579
8 8 New Zealand Sri Lanka no result NaN Auckland Jan 17, 2015 ODI # 3580
9 9 Afghanistan Ireland Afghanistan 71 runs Dubai (DSC) Jan 17, 2015 ODI # 3581
10 10 Australia India Australia 4 wickets Melbourne Jan 18, 2015 ODI # 3582
11 11 South Africa West Indies South Africa 148 runs Johannesburg Jan 18, 2015 ODI # 3583
12 12 Ireland Scotland no result NaN ICCA Dubai Jan 19, 2015 ODI # 3584
13 13 New Zealand Sri Lanka New Zealand 4 wickets Nelson Jan 20, 2015 ODI # 3585
14 14 England India England 9 wickets Brisbane Jan 20, 2015 ODI # 3586
15 15 South Africa West Indies South Africa 9 wickets East London Jan 21, 2015 ODI # 3587
16 16 New Zealand Sri Lanka New Zealand 108 runs Dunedin Jan 23, 2015 ODI # 3588
17 17 Australia England Australia 3 wickets Hobart Jan 23, 2015 ODI # 3589
18 18 New Zealand Sri Lanka New Zealand 120 runs Dunedin Jan 25, 2015 ODI # 3590
19 19 South Africa West Indies West Indies 1 wicket Port Elizabeth Jan 25, 2015 ODI # 3591
20 20 Australia India no result NaN Sydney Jan 26, 2015 ODI # 3592
21 21 South Africa West Indies South Africa 131 runs Centurion Jan 28, 2015 ODI # 3593
22 22 New Zealand Sri Lanka Sri Lanka 34 runs Wellington Jan 29, 2015 ODI # 3594
23 23 England India England 3 wickets Perth Jan 30, 2015 ODI # 3595
24 24 New Zealand Pakistan New Zealand 7 wickets Wellington Jan 31, 2015 ODI # 3596
25 25 Australia England Australia 112 runs Perth Feb 1, 2015 ODI # 3597
26 26 New Zealand Pakistan New Zealand 119 runs Napier Feb 3, 2015 ODI # 3598
27 27 New Zealand Sri Lanka New Zealand 98 runs Christchurch Feb 14, 2015 ODI # 3599
28 28 Australia England Australia 111 runs Melbourne Feb 14, 2015 ODI # 3600
29 29 South Africa Zimbabwe South Africa 62 runs Hamilton Feb 15, 2015 ODI # 3601
... ... ... ... ... ... ... ... ...
617 115 Ireland Zimbabwe Ireland 6 wickets Belfast Jul 7, 2019 ODI # 4189
618 116 India New Zealand New Zealand 18 runs Manchester Jul 9-10, 2019 ODI # 4190
619 117 England Australia England 8 wickets Birmingham Jul 11, 2019 ODI # 4191
620 118 England New Zealand tied NaN Lord's Jul 14, 2019 ODI # 4192
621 119 Sri Lanka Bangladesh Sri Lanka 91 runs Colombo (RPS) Jul 26, 2019 ODI # 4193
622 120 Sri Lanka Bangladesh Sri Lanka 7 wickets Colombo (RPS) Jul 28, 2019 ODI # 4194
623 121 Sri Lanka Bangladesh Sri Lanka 122 runs Colombo (RPS) Jul 31, 2019 ODI # 4195
624 122 West Indies India no result NaN Providence Aug 8, 2019 ODI # 4196
625 123 West Indies India India 59 runs Port of Spain Aug 11, 2019 ODI # 4197
626 124 Oman P.N.G. Oman 4 wickets Aberdeen Aug 14, 2019 ODI # 4198
627 125 West Indies India India 6 wickets Port of Spain Aug 14, 2019 ODI # 4199
628 126 Scotland Oman Oman 8 wickets Aberdeen Aug 15, 2019 ODI # 4200
629 127 Scotland P.N.G. Scotland 3 wickets Aberdeen Aug 17, 2019 ODI # 4201
630 128 Scotland Oman Scotland 85 runs Aberdeen Aug 18, 2019 ODI # 4202
631 129 Scotland P.N.G. Scotland 38 runs Aberdeen Aug 20, 2019 ODI # 4203
632 130 Oman P.N.G. Oman 4 wickets Aberdeen Aug 21, 2019 ODI # 4204
633 131 U.S.A. P.N.G. U.S.A. 5 runs Lauderhill Sep 13, 2019 ODI # 4205
634 132 U.S.A. Namibia U.S.A. 5 wickets Lauderhill Sep 17, 2019 ODI # 4206
635 133 U.S.A. P.N.G. U.S.A. 62 runs Lauderhill Sep 19, 2019 ODI # 4207
636 134 U.S.A. Namibia Namibia 139 runs Lauderhill Sep 20, 2019 ODI # 4208
637 135 Namibia P.N.G. Namibia 4 wickets Lauderhill Sep 22, 2019 ODI # 4209
638 136 Namibia P.N.G. Namibia 27 runs Lauderhill Sep 23, 2019 ODI # 4210
639 137 Pakistan Sri Lanka Pakistan 67 runs Karachi Sep 30, 2019 ODI # 4211
640 138 Pakistan Sri Lanka Pakistan 5 wickets Karachi Oct 2, 2019 ODI # 4212
641 139 Afghanistan West Indies West Indies 7 wickets Lucknow Nov 6, 2019 ODI # 4213
642 140 Afghanistan West Indies West Indies 47 runs Lucknow Nov 9, 2019 ODI # 4214
643 141 Afghanistan West Indies West Indies 5 wickets Lucknow Nov 11, 2019 ODI # 4215
644 142 U.A.E. U.S.A. U.S.A. 3 wickets Sharjah Dec 8, 2019 ODI # 4216
645 143 Scotland U.S.A. U.S.A. 35 runs Sharjah Dec 9, 2019 ODI # 4217
646 144 U.A.E. U.S.A. U.S.A. 98 runs ICCA Dubai Dec 12, 2019 ODI # 4218

647 rows Ă— 8 columns

--> We merged five different results file from five different years into one to make one consoliated dataframe. This was done so that it would become easier to access all the results at once and we would not have to refer back and forth to access the results from different years.

We removed the data of the teams which were not playing the World Cup from the dataframe

In [21]:
# Get names of indexes for which column is not in given 10 teams
indexNames = Results[ (Results['Team 1'] != 'India') & (Results['Team 1'] != 'England') & (Results['Team 1'] != 'Pakistan') & (Results['Team 1'] != 'Sri Lanka') & (Results['Team 1'] != 'Australia') & (Results['Team 1'] != 'South Africa') & (Results['Team 1'] != 'New Zealand') & (Results['Team 1'] != 'Bangladesh') & (Results['Team 1'] != 'Afghanistan') & (Results['Team 1'] != 'West Indies') ].index
 
# Delete these row indexes from dataFrame
Results.drop(indexNames , inplace=True)
In [22]:
# Get names of indexes for which column is not in given 10 teams
indexNames = Results[ (Results['Team 2'] != 'India') & (Results['Team 2'] != 'England') & (Results['Team 2'] != 'Pakistan') & (Results['Team 2'] != 'Sri Lanka') & (Results['Team 2'] != 'Australia') & (Results['Team 2'] != 'South Africa') & (Results['Team 2'] != 'New Zealand') & (Results['Team 2'] != 'Bangladesh') & (Results['Team 2'] != 'Afghanistan') & (Results['Team 2'] != 'West Indies') ].index
 
# Delete these row indexes from dataFrame
Results.drop(indexNames , inplace=True)
In [23]:
Results
Out[23]:
Team 1 Team 2 Winner Margin Ground Match Date Scorecard
5 New Zealand Sri Lanka Sri Lanka 6 wickets Hamilton Jan 15, 2015 ODI # 3577
7 South Africa West Indies South Africa 61 runs Durban Jan 16, 2015 ODI # 3579
13 New Zealand Sri Lanka New Zealand 4 wickets Nelson Jan 20, 2015 ODI # 3585
14 England India England 9 wickets Brisbane Jan 20, 2015 ODI # 3586
19 South Africa West Indies West Indies 1 wicket Port Elizabeth Jan 25, 2015 ODI # 3591
22 New Zealand Sri Lanka Sri Lanka 34 runs Wellington Jan 29, 2015 ODI # 3594
63 Afghanistan England England 9 wickets Sydney Mar 13, 2015 ODI # 3635
70 Australia Pakistan Australia 6 wickets Adelaide Mar 20, 2015 ODI # 3642
72 New Zealand South Africa New Zealand 4 wickets Auckland Mar 24, 2015 ODI # 3644
77 Bangladesh Pakistan Bangladesh 8 wickets Dhaka Apr 22, 2015 ODI # 3649
94 Bangladesh South Africa Bangladesh 7 wickets Dhaka Jul 12, 2015 ODI # 3666
96 Bangladesh South Africa Bangladesh 9 wickets Chattogram Jul 15, 2015 ODI # 3668
97 Sri Lanka Pakistan Sri Lanka 2 wickets Pallekele Jul 15, 2015 ODI # 3669
98 Sri Lanka Pakistan Pakistan 135 runs Colombo (RPS) Jul 19, 2015 ODI # 3670
99 Sri Lanka Pakistan Pakistan 7 wickets Colombo (RPS) Jul 22, 2015 ODI # 3671
100 Sri Lanka Pakistan Sri Lanka 165 runs Hambantota Jul 26, 2015 ODI # 3672
106 South Africa New Zealand South Africa 62 runs Durban Aug 26, 2015 ODI # 3678
108 England Australia Australia 59 runs Southampton Sep 3, 2015 ODI # 3680
109 England Australia Australia 64 runs Lord's Sep 5, 2015 ODI # 3681
110 England Australia England 93 runs Manchester Sep 8, 2015 ODI # 3682
112 England Australia Australia 8 wickets Manchester Sep 13, 2015 ODI # 3684
138 England Pakistan England 6 wickets Sharjah Nov 17, 2015 ODI # 3710
140 England Pakistan England 84 runs Dubai (DSC) Nov 20, 2015 ODI # 3712
145 New Zealand Sri Lanka Sri Lanka 8 wickets Nelson Dec 31, 2015 ODI # 3717
5 Australia India Australia 5 wickets Perth Jan 12, 2016 ODI # 3723
7 Australia India Australia 3 wickets Melbourne Jan 17, 2016 ODI # 3725
13 New Zealand Australia New Zealand 159 runs Auckland Feb 3, 2016 ODI # 3731
14 South Africa England England 39 runs Bloemfontein Feb 3, 2016 ODI # 3732
19 South Africa England South Africa 1 wicket Johannesburg Feb 12, 2016 ODI # 3737
22 West Indies Australia Australia 6 wickets Providence Jun 5, 2016 ODI # 3740
... ... ... ... ... ... ... ...
99 Afghanistan Bangladesh Bangladesh 3 runs Abu Dhabi Sep 23, 2018 ODI # 4045
100 Afghanistan India tied NaN Dubai (DSC) Sep 25, 2018 ODI # 4046
106 Sri Lanka England no result NaN Dambulla Oct 10, 2018 ODI # 4052
108 Sri Lanka England England 7 wickets Pallekele Oct 17, 2018 ODI # 4054
109 Sri Lanka England England 18 runs Pallekele Oct 20, 2018 ODI # 4055
110 India West Indies India 8 wickets Guwahati Oct 21, 2018 ODI # 4056
112 Sri Lanka England Sri Lanka 219 runs Colombo (RPS) Oct 23, 2018 ODI # 4058
5 Australia India India 7 wickets Melbourne Jan 18, 2019 ODI # 4079
7 South Africa Pakistan South Africa 5 wickets Durban Jan 22, 2019 ODI # 4081
13 South Africa Pakistan Pakistan 8 wickets Johannesburg Jan 27, 2019 ODI # 4087
14 New Zealand India India 7 wickets Mount Maunganui Jan 28, 2019 ODI # 4088
19 New Zealand Bangladesh New Zealand 8 wickets Napier Feb 13, 2019 ODI # 4093
22 West Indies England England 6 wickets Bridgetown Feb 20, 2019 ODI # 4096
63 Bangladesh West Indies Bangladesh 5 wickets Dublin (Malahide) May 17, 2019 ODI # 4137
70 Pakistan West Indies West Indies 7 wickets Nottingham May 31, 2019 ODI # 4144
72 Afghanistan Australia Australia 7 wickets Bristol Jun 1, 2019 ODI # 4146
77 Bangladesh New Zealand New Zealand 2 wickets The Oval Jun 5, 2019 ODI # 4151
94 England Sri Lanka Sri Lanka 20 runs Leeds Jun 21, 2019 ODI # 4168
96 New Zealand West Indies New Zealand 5 runs Manchester Jun 22, 2019 ODI # 4170
97 Pakistan South Africa Pakistan 49 runs Lord's Jun 23, 2019 ODI # 4171
98 Afghanistan Bangladesh Bangladesh 62 runs Southampton Jun 24, 2019 ODI # 4172
99 England Australia Australia 64 runs Lord's Jun 25, 2019 ODI # 4173
100 New Zealand Pakistan Pakistan 6 wickets Birmingham Jun 26, 2019 ODI # 4174
106 Sri Lanka West Indies Sri Lanka 23 runs Chester-le-Street Jul 1, 2019 ODI # 4180
108 Bangladesh India India 28 runs Birmingham Jul 2, 2019 ODI # 4182
109 England New Zealand England 119 runs Chester-le-Street Jul 3, 2019 ODI # 4183
110 Afghanistan West Indies West Indies 23 runs Leeds Jul 4, 2019 ODI # 4184
112 Bangladesh Pakistan Pakistan 94 runs Lord's Jul 5, 2019 ODI # 4186
138 Pakistan Sri Lanka Pakistan 5 wickets Karachi Oct 2, 2019 ODI # 4212
140 Afghanistan West Indies West Indies 47 runs Lucknow Nov 9, 2019 ODI # 4214

103 rows Ă— 7 columns

Above, we have the final dataframe with the results of all the teams that played in the world cup. These are essentially the top 10 teams that qualified fo the World Cup based on ICC (International Cricket Council)data. The International Cricket Council is the global governing body of cricket and conducts international matches

In [24]:
# Getting the teams with each of their number of wins
Results_winner = Results.groupby('Winner').count()['Ground']
Results_winner
Out[24]:
Winner
Australia       12
Bangladesh       8
England         21
India            8
New Zealand     11
Pakistan        13
South Africa    12
Sri Lanka        9
West Indies      6
no result        2
tied             1
Name: Ground, dtype: int64

--> The above stats show us that England has won most number of matches played at home. This makes them the favourites for the World Cup because the World Cup is to be played in England.

  • Looking at just the numbers of victories at home cannot be the deciding factor. Let's gather some more insights about the grounds in England where the World Cup was held.

Extracting the Ground information and then extracting the information about the grounds in England where the World Cup was played

In [25]:
ground_averages = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Ground_Averages.csv')
ground_averages.head()
Out[25]:
Ground Span Mat Won Tied NR Runs Wkts Balls Ave RPO
0 Eden Gardens, Kolkata - India 2013-2017 4 4 0 0 2161 72 2297 30.01 5.64
1 Feroz Shah Kotla, Delhi - India 2013-2019 4 4 0 0 1789 75 2331 23.85 4.60
2 Melbourne Cricket Ground - Australia 2013-2019 15 15 0 0 7656 217 8482 35.28 5.41
3 Saurashtra Cricket Association Stadium, Rajkot... 2013-2015 2 2 0 0 1163 26 1200 44.73 5.81
4 Adelaide Oval - Australia 2013-2019 10 10 0 0 4863 157 5645 30.97 5.16

Above dataframe contains the basic information about the grounds like:

  • Total matches played
  • Time period of when these matches were played
  • Total runs scored and total wickets that fell and so on.
In [26]:
# Defining the grounds of England
World_Cup_Grounds=["Lord's, London - England", "The Rose Bowl, Southampton - England", "Trent Bridge, Nottingham - England", "Sophia Gardens, Cardiff - England", "Kennington Oval, London - England", "Edgbaston, Birmingham - England", "Old Trafford, Manchester - England", "Riverside Ground, Chester-le-Street - England", "Headingley, Leeds - England", "County Ground, Bristol - England", "County Ground, Taunton - England"]
In [27]:
World_Cup_Grounds
Out[27]:
["Lord's, London - England",
 'The Rose Bowl, Southampton - England',
 'Trent Bridge, Nottingham - England',
 'Sophia Gardens, Cardiff - England',
 'Kennington Oval, London - England',
 'Edgbaston, Birmingham - England',
 'Old Trafford, Manchester - England',
 'Riverside Ground, Chester-le-Street - England',
 'Headingley, Leeds - England',
 'County Ground, Bristol - England',
 'County Ground, Taunton - England']
In [28]:
Worldcup_Ground_Stats = []
England_Grounds = ground_averages.Ground
for grounds in England_Grounds:
    for venues in World_Cup_Grounds :
        if grounds in venues:
            Worldcup_Ground_Stats.append((venues))
In [29]:
Worldcup_Ground_Stats
Out[29]:
["Lord's, London - England",
 'The Rose Bowl, Southampton - England',
 'Trent Bridge, Nottingham - England',
 'Sophia Gardens, Cardiff - England',
 'Kennington Oval, London - England',
 'Edgbaston, Birmingham - England',
 'Old Trafford, Manchester - England',
 'Riverside Ground, Chester-le-Street - England',
 'Headingley, Leeds - England',
 'County Ground, Bristol - England']

--> The above function helps us to filter out only those grounds where the World Cup was played. As our earlier dataframe contained all the grounds' information, there was some irrelevant data.

We see here that we do not have data for "County Ground, Taunton - England" in "ground_averages" dataframe so we do not include that in our analysis.

In [30]:
WorldCup_Grounds_Stats = ground_averages[ground_averages.Ground.isin([Ground for Ground in Worldcup_Ground_Stats])]
In [31]:
WorldCup_Grounds_Stats
Out[31]:
Ground Span Mat Won Tied NR Runs Wkts Balls Ave RPO
34 Lord's, London - England 2013-2018 7 7 0 0 3549 110 3851 32.26 5.52
35 The Rose Bowl, Southampton - England 2013-2019 8 8 0 0 4766 103 4522 46.27 6.32
36 Trent Bridge, Nottingham - England 2013-2019 9 7 1 1 4944 112 4530 44.14 6.54
37 Sophia Gardens, Cardiff - England 2013-2018 14 13 1 0 6690 221 7264 30.27 5.52
38 Kennington Oval, London - England 2013-2019 17 15 0 2 8038 208 8369 38.64 5.76
39 Edgbaston, Birmingham - England 2013-2017 15 12 0 3 5950 182 6654 32.69 5.36
50 Old Trafford, Manchester - England 2013-2018 6 6 0 0 2294 87 2627 26.36 5.23
73 Riverside Ground, Chester-le-Street - England 2014-2018 3 3 0 0 1454 46 1475 31.60 5.91
75 Headingley, Leeds - England 2014-2019 6 5 0 0 3262 83 3351 39.30 5.84
90 County Ground, Bristol - England 2016-2019 4 3 0 1 1848 55 1746 33.60 6.35

The above mentioned grounds in the dataframe are the venues that hosted the ICC Cricket World Cup 2019 matches apart from "County Ground, Taunton - England"

Extracting match results table

In [32]:
match_results = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Match_Results.csv')
match_results.head()
Out[32]:
Unnamed: 0 Team Span Mat Won Lost Tied NR %
0 0 Afghanistan 2009-2019 126 59 63 1 3 48.37
1 1 Africa XI 2005-2007 6 1 4 0 1 20.00
2 2 Asia XI 2005-2007 7 4 2 0 1 66.66
3 3 Australia 1971-2019 942 573 326 9 34 63.60
4 4 Bangladesh 1986-2019 373 125 241 0 7 34.15
In [33]:
# Get names of indexes for which column is not in given 10 teams
match_results.rename(columns = {'Country':'Team'}, inplace = True) 
indexNames = match_results[ (match_results['Team'] != 'India') & (match_results['Team'] != 'England') & (match_results['Team'] != 'Pakistan') & (match_results['Team'] != 'Sri Lanka') & (match_results['Team'] != 'Australia') & (match_results['Team'] != 'South Africa') & (match_results['Team'] != 'New Zealand') & (match_results['Team'] != 'Bangladesh') & (match_results['Team'] != 'Afghanistan') & (match_results['Team'] != 'West Indies') ].index
 
# Delete these row indexes from dataFrame
match_results.drop(indexNames , inplace=True)
In [34]:
match_results
Out[34]:
Unnamed: 0 Team Span Mat Won Lost Tied NR %
0 0 Afghanistan 2009-2019 126 59 63 1 3 48.37
3 3 Australia 1971-2019 942 573 326 9 34 63.60
4 4 Bangladesh 1986-2019 373 125 241 0 7 34.15
8 8 England 1971-2019 743 374 333 9 27 52.86
11 11 India 1974-2019 978 509 419 9 41 54.80
17 17 New Zealand 1973-2019 768 348 373 7 40 48.28
19 19 Pakistan 1973-2019 927 486 413 8 20 54.02
22 22 South Africa 1991-2019 619 381 215 6 17 63.78
23 23 Sri Lanka 1975-2019 849 386 421 5 37 47.84
26 26 West Indies 1973-2019 813 397 376 10 30 51.34

The data above represents the Top 10 teams of interest and their corresponding match stats like win, Lost, and tied number

In [35]:
# creating a win perentage column as number of matches played by different teams is different
match_results['Win Percent'] = (match_results['Won']/match_results['Mat'])*100
match_results
Out[35]:
Unnamed: 0 Team Span Mat Won Lost Tied NR % Win Percent
0 0 Afghanistan 2009-2019 126 59 63 1 3 48.37 46.825397
3 3 Australia 1971-2019 942 573 326 9 34 63.60 60.828025
4 4 Bangladesh 1986-2019 373 125 241 0 7 34.15 33.512064
8 8 England 1971-2019 743 374 333 9 27 52.86 50.336474
11 11 India 1974-2019 978 509 419 9 41 54.80 52.044990
17 17 New Zealand 1973-2019 768 348 373 7 40 48.28 45.312500
19 19 Pakistan 1973-2019 927 486 413 8 20 54.02 52.427184
22 22 South Africa 1991-2019 619 381 215 6 17 63.78 61.550889
23 23 Sri Lanka 1975-2019 849 386 421 5 37 47.84 45.465253
26 26 West Indies 1973-2019 813 397 376 10 30 51.34 48.831488

Section3: Visualizations

Grounds Data Analysis

In [36]:
import matplotlib
# ax = match_results.plot.bar(x='Team', y='Win Percent')
sns.barplot(x = "Team", y = "Win Percent", data = match_results).set_title("Win Percent of each Country")
plt.xlabel("Team")
plt.ylabel("Win Percent")
plt.xticks(rotation = 90)
Out[36]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), <a list of 10 Text xticklabel objects>)

Fig1.

--> We can observe from above graph,

  • That in previous matches, the win percentage of South Africa is maximum, followed by Australia and Pakistan.
  • This means that South Africa is a strong team and wins most of its matches. Australia is the second best team followed by Pakistan. England is not among the top three teams based on Win Percentage. Neither is India.
  • From the above graph, it would not be wrong to say that South Africa will win the cup. However, lets analyse a little more.
In [37]:
ODI_Scores_Data  = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ODI_Match_Results.csv')
In [38]:
WC_venue_pitches = ["The Oval, London","Trent Bridge, Nottingham","Sophia Gardens, Cardiff","County Ground, Bristol","Rose Bowl, Southampton","County Ground, Taunton","Old Trafford, Manchester","Edgbaston, Birmingham","Headingley, Leeds","Lord's, London","Riverside Ground, Chester-le-Street"]
In [39]:
#Total Grounds
WC_Ground_Stats = []
ODI_Grounds = ODI_Scores_Data.Ground
for i in ODI_Grounds:
    for j in WC_venue_pitches:
        if i in j:
            WC_Ground_Stats.append((i,j))
In [40]:
Ground_names = dict(set(WC_Ground_Stats))
def Full_Ground_names(value):
    return Ground_names[value]
Ground_names
Out[40]:
{'Cardiff': 'Sophia Gardens, Cardiff',
 'Bristol': 'County Ground, Bristol',
 'Southampton': 'Rose Bowl, Southampton',
 "Lord's": "Lord's, London",
 'Leeds': 'Headingley, Leeds',
 'Manchester': 'Old Trafford, Manchester',
 'Nottingham': 'Trent Bridge, Nottingham',
 'The Oval': 'The Oval, London',
 'Birmingham': 'Edgbaston, Birmingham',
 'Chester-le-Street': 'Riverside Ground, Chester-le-Street'}
In [41]:
#Let's gather the data of all ODI's in these WC Venues
WC_England_History = ODI_Scores_Data[ODI_Scores_Data.Ground.isin([Ground[0] for Ground in WC_Ground_Stats])]
WC_England_History["Ground"] = WC_England_History.Ground.apply(Full_Ground_names)
WC_England_History.head()
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
Out[41]:
Unnamed: 0 Result Margin BR Toss Bat Opposition Ground Start Date Match_ID Country Country_ID
75 566 won 5 wickets 19.0 won 2nd v England Lord's, London 31 May 2013 ODI # 3360 Newzealad 5
76 860 lost 5 wickets 19.0 lost 1st v New Zealand Lord's, London 31 May 2013 ODI # 3360 England 1
77 567 won 86 runs NaN won 1st v England Rose Bowl, Southampton 2 Jun 2013 ODI # 3361 Newzealad 5
78 861 lost 86 runs NaN lost 2nd v New Zealand Rose Bowl, Southampton 2 Jun 2013 ODI # 3361 England 1
79 568 lost 34 runs NaN won 2nd v England Trent Bridge, Nottingham 5 Jun 2013 ODI # 3362 Newzealad 5
In [42]:
winnings = WC_England_History[["Country","Result"]]
winnings["count"] = 1
Ground_Results_Per_Team = winnings.groupby(["Country","Result"]).aggregate(["sum"])
Ground_Results_Per_Team = Ground_Results_Per_Team.groupby(level=0).apply(lambda x:100 * x / float(x.sum())).reset_index()
Ground_Results_Per_Team.columns = ["Country","Result","Count"]
Ground_Results_Per_Team.head()
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[42]:
Country Result Count
0 Australia aban 4.761905
1 Australia lost 52.380952
2 Australia n/r 19.047619
3 Australia won 23.809524
4 Bangladesh lost 50.000000
In [43]:
import plotly.graph_objects as go
fig = go.Figure(data=[
    go.Bar(name='aban', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'abandoned'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'aban'].Count),
    go.Bar(name='lost', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'lost'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'lost'].Count),
    go.Bar(name='n/r', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'n/r'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'n/r'].Count),
    go.Bar(name='won', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'won'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'won'].Count),
    go.Bar(name='-', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == '-'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == '-'].Count),
    go.Bar(name='tied', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'tied'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'tied'].Count),
])
# Change the bar mode
fig.update_layout(barmode='group', title = 'Details of Matches - Country Wise',xaxis_title="Country",yaxis_title="Count Of Matches",)
fig.show()

Fig 2

--> The above interactive graph shows the comparision of the number of matches each team has lost,won, tied. It also represents the number of matches that were abandoned or had no result.

  • This gives us a clear comparison among various teams and their performances in England.
  • This graph tells us that after India, England has the highest number of matches won. This is different from the result shown in Fig1. This is because, Fig2 shows the data for only those matches that were played in England. Clearly we can see that England performs better on its home grounds. (Home Advantage)
  • This also gives us the answer to one of our questions: If home gorund affects the performance of a team or not? Yes it does.

--> South Africa is not shown in good light here. They have lost more matches than they have won in England. This is in contrast to the results we saw in the previous graph, Fig1. Hence, we can say that now England and India are major contenders of the World Cup.

Now lets see and analyse if its advantageous to Bat first or Bowl first

In [44]:
Inning_Wins = WC_England_History[WC_England_History.Result == "won"].Bat.value_counts(normalize = True).reset_index()
sns.barplot(x = "index", y = "Bat", data = Inning_Wins).set_title("Wins by Innings")
plt.xlabel("Innings")
plt.ylabel("Win Percentage")
Out[44]:
Text(0,0.5,'Win Percentage')

Fig 3.

--> Fig3 shows that the team that bats 2nd win more matches than the teams that bats 1st.

  • This means that the teams chasing have a better chance of winning the match. This may be due to the fact that the team batting second already knows the target and they can devise a strategy to achieve that score efficiently. Also they have watched the team batting first and know how the pitch is going to behave.
  • Also, the team batting first does not know much about the condition the pitch is in, as they are the first team to bat. It might take them some 'overs/balls' to figure out the optimum hitting strategy. Whether to hit boundaries or keep taking sngles. Whether the pitch is damp or dry? How the ball is going to bounce and swing on the pitch?

Note: When one team bats and other bowls, it is termed as an innings. after 50 overs, the teams exchange roles and the second team now bats and first one bowls. This is now called second innings.

In [45]:
Pitch_Innings = WC_England_History[WC_England_History.Result == "won"][["Bat","Ground"]]
Pitch_Innings["Count"] = 1
Pitch_Innings = Pitch_Innings.groupby(["Ground","Bat"]).sum()
Pitch_Innings = Pitch_Innings.groupby(level=0).apply(lambda x:100 * x / float(x.sum())).reset_index()
Pitch_Innings.columns = ["Ground", "Bat","Wins"]
Pitch_Innings.head( 5 )
Out[45]:
Ground Bat Wins
0 County Ground, Bristol 1st 33.333333
1 County Ground, Bristol 2nd 66.666667
2 Edgbaston, Birmingham 1st 41.666667
3 Edgbaston, Birmingham 2nd 58.333333
4 Headingley, Leeds 1st 40.000000
In [46]:
plt.figure(figsize=(15,8))
g = sns.lineplot( x = "Ground", y = "Wins", hue = "Bat", style="Bat", markers=True, data = Pitch_Innings)
plt.xticks(rotation = 60)
plt.title('Win - Batting 1st or 2nd')
Out[46]:
Text(0.5,1,'Win - Batting 1st or 2nd')

Fig 4

--> The above graph shows,

  • that in specific grounds, the team that bats first wins more number of time and at other venues, the team batting second wins more matches. For example, at County Ground, Bristol, the team batting second is expected to win the match based on the past trend and at Lord's, the team batting first often wind the match.
  • This supports the findings in Fig3. On most of the grounds, the team batting second wins.

Now we will analyse the effect of winning or losing the toss on match results

In [47]:
Inning_Wins = WC_England_History[WC_England_History.Result == "won"].Toss.value_counts(normalize = True).reset_index()
sns.barplot(x = "index", y = "Toss", data = Inning_Wins).set_title("Wins by Toss")
plt.xlabel("Toss")
plt.ylabel("Win Percentage")
Out[47]:
Text(0,0.5,'Win Percentage')

Fig 5a.

--> The above graph shows a surprising finding.

  • Usually it is said that Toss is one of the most important factors that decides the outcome of the game. People believe that if a team wins the toss, there is a high chance that it wins the match as it gets to choose whether to bat first or bowl first.
  • But from the graph above, we see that the team losing the toss wins more matches. This is in contradiction to the usual belief.
  • This means that on the grounds in England, Toss really doesn't matter. There are other important factors that govern the result of the game.
In [48]:
Pitch_Innings = WC_England_History[WC_England_History.Result == "won"][["Toss","Ground"]]
Pitch_Innings["Count"] = 1
Pitch_Innings = Pitch_Innings.groupby(["Ground","Toss"]).sum()
Pitch_Innings = Pitch_Innings.groupby(level=0).apply(lambda x:100 * x / float(x.sum())).reset_index()
Pitch_Innings.columns = ["Ground", "Toss","Wins"]
Pitch_Innings.head( 5 )
Out[48]:
Ground Toss Wins
0 County Ground, Bristol lost 66.666667
1 County Ground, Bristol won 33.333333
2 Edgbaston, Birmingham lost 50.000000
3 Edgbaston, Birmingham won 50.000000
4 Headingley, Leeds lost 80.000000
In [49]:
plt.figure(figsize=(15,8))
sns.barplot(x = "Ground", y = "Wins", hue = "Toss", data = Pitch_Innings).set_title("Results - Based on Toss")
plt.xticks(rotation = 60)
Out[49]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), <a list of 10 Text xticklabel objects>)

Fig 5b.

--> This is a continuation of Fig 5a.

  • Here we see that 6/10 grounds favour the teams that don't win the toss.
  • Only at 'Oval' and 'Trent Bridge', the team winning the toss wind the match.
  • If anyone of us were a Captain of a team, we would like to loose the toss at these grounds as it would mean that we have higher possibility of winning the match.

Lets see the distribution of scores in different grounds of England

In [50]:
ODI_Scores  = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ODI_Match_Totals.csv')
In [51]:
#Let's gather the data of all ODI's in these WC Venues
WC_England_Scores = ODI_Scores[ODI_Scores.Ground.isin([Ground[0] for Ground in WC_Ground_Stats])]
WC_England_Scores["Ground"] = WC_England_Scores.Ground.apply(Full_Ground_names)
WC_England_Scores.head()
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[51]:
Unnamed: 0 Score Overs RPO Target Inns Result Opposition Ground Start Date Match_ID Country Country_ID
74 557 231/5 46.5 4.93 228.0 2 won v England Lord's, London 31 May 2013 ODI # 3360 Newzealad 5
75 844 227/9 50.0 4.54 NaN 1 lost v New Zealand Lord's, London 31 May 2013 ODI # 3360 England 1
76 558 359/3 50.0 7.18 NaN 1 won v England Rose Bowl, Southampton 2 Jun 2013 ODI # 3361 Newzealad 5
77 845 273 44.1 6.18 360.0 2 lost v New Zealand Rose Bowl, Southampton 2 Jun 2013 ODI # 3361 England 1
78 559 253 46.3 5.44 288.0 2 lost v England Trent Bridge, Nottingham 5 Jun 2013 ODI # 3362 Newzealad 5
In [52]:
WC_England_Scores = WC_England_Scores[~WC_England_Scores.Score.str.contains("D")]
In [53]:
Scores = [int(item[0]) for item in WC_England_Scores.Score.str.split("/")]
WC_England_Scores["Score_without_wickets"] = Scores
Stadium_Scores = WC_England_Scores[["Score_without_wickets","Ground"]]
Stadium_Scores = Stadium_Scores[Stadium_Scores.Score_without_wickets > 50]
plt.figure(figsize=(12,6))
plt.xticks(rotation = 60)
sns.violinplot(x = "Ground", y = "Score_without_wickets",data = Stadium_Scores).set_title("Scores vs Pitches")
plt.ylabel("Scores")
Out[53]:
Text(0,0.5,'Scores')

Fig 6.

--> The above plot tells us about distribution of scores on different grounds.

  • The white dot in the middle gives us the Mean of the scores.
  • The length and the area under the graph, gives us the Standard Deviation of the scores.
  • We can say that at Rose Bowl, Headlingley and Lord's, the scores are quite consistent and there is not much deviations in scores.
  • But, at The Oval, Old Trafford, Edgbaston and County, the range of scores is very high and thus prediction of scores is difficult.
  • This would mean that, at Lords, a score of 250 runs or above would be hard to chase down and hence the chances of victory for the team with that score are high.

Lets find out, at what ground which team has won the maximum number of times

In [54]:
Grounds = WC_England_Scores.Ground.unique()
WC_Teams = WC_England_Scores.Country.unique()
Ground_Winnings = {}
for Ground in Grounds:
    Ground_Winnings.update({Ground : {}})
    for Team in WC_Teams:
        Country_Ground_Record = WC_England_Scores[(WC_England_Scores.Country == Team) & \
                                                   (WC_England_Scores.Ground == Ground)]
        matches_played = len(Country_Ground_Record)
        if matches_played == 0:
            continue
        matches_won = len(Country_Ground_Record[Country_Ground_Record.Result == "won"])
        winning_percentage = matches_won / matches_played * 100
        Ground_Winnings[Ground].update({Team : {"matches_played" : matches_played,\
                                       "matches_won": matches_won,\
                                       "winning_percentage" : winning_percentage}})
In [55]:
Data_Frame_Data = []
for Pitch, P_Data in Ground_Winnings.items():
    
    for Team, Team_Data in P_Data.items():
        inside = []
        inside.extend([Pitch,Team,Team_Data["matches_played"],\
                       Team_Data["matches_won"],Team_Data["winning_percentage"]])
        Data_Frame_Data.append(inside)
In [56]:
Columns = ["Ground", "Country","Played","Won","Win_Percentage"]
Data_Frame_Data
Pitch_Team_Winnings = pd.DataFrame(Data_Frame_Data, columns=Columns)
In [57]:
Pitch_Team_Winnings.groupby(['Ground','Country']).mean()
Out[57]:
Played Won Win_Percentage
Ground Country
County Ground, Bristol England 4 3 75.000000
Pakistan 1 0 0.000000
SriLanka 1 0 0.000000
WestIndies 1 0 0.000000
Edgbaston, Birmingham Australia 4 0 0.000000
Bangladesh 1 0 0.000000
England 8 4 50.000000
India 5 5 100.000000
Newzealad 3 0 0.000000
Pakistan 4 1 25.000000
SouthAfrica 2 1 50.000000
SriLanka 2 1 50.000000
Headingley, Leeds Australia 1 0 0.000000
England 6 5 83.333333
India 2 0 0.000000
Pakistan 2 0 0.000000
SouthAfrica 1 0 0.000000
Lord's, London Australia 1 1 100.000000
England 7 3 42.857143
India 1 0 0.000000
Newzealad 1 1 100.000000
Pakistan 1 0 0.000000
SouthAfrica 1 1 100.000000
SriLanka 1 1 100.000000
Old Trafford, Manchester Australia 4 2 50.000000
England 6 4 66.666667
SriLanka 1 0 0.000000
WestIndies 1 0 0.000000
Riverside Ground, Chester-le-Street Australia 1 0 0.000000
England 3 2 66.666667
... ... ... ... ...
Rose Bowl, Southampton Australia 2 2 100.000000
England 8 4 50.000000
Newzealad 2 2 100.000000
Pakistan 2 0 0.000000
SouthAfrica 1 0 0.000000
WestIndies 1 0 0.000000
Sophia Gardens, Cardiff Australia 2 0 0.000000
Bangladesh 1 1 100.000000
England 8 5 62.500000
India 3 3 100.000000
Newzealad 4 1 25.000000
Pakistan 3 3 100.000000
SouthAfrica 2 0 0.000000
SriLanka 4 0 0.000000
WestIndies 1 0 0.000000
The Oval, London Australia 3 0 0.000000
Bangladesh 2 0 0.000000
England 8 6 75.000000
India 4 2 50.000000
Newzealad 1 1 100.000000
Pakistan 3 1 33.333333
SouthAfrica 3 1 33.333333
SriLanka 6 3 50.000000
WestIndies 3 1 33.333333
Trent Bridge, Nottingham Australia 1 0 0.000000
England 9 5 55.555556
India 2 2 100.000000
Newzealad 2 0 0.000000
Pakistan 2 0 0.000000
SriLanka 1 0 0.000000

62 rows Ă— 3 columns

--> From the DataFrame above,

  • We see that most of the matches at all the different venues were won by India and England.
  • There are one or two other teams that pop out in the results but that is because the number of matches played for those cases are very less so the percentage is reflected as quite high for them.
  • This gives us a good idea that England and India are the favourites for the World Cup 2019.

Analysis of total matches and the number of matches won, lost and tied

In [58]:
import numpy as np
import matplotlib.pyplot as plt
category_names = [ 'lost', 'aban', 'tied', 'won']
results = {
    'Australia': [52.380952, 4.761905, 0, 23.809524],
    'Bangladesh': [50.000000, 0, 0, 25.000000],
    'England': [30.000000, 2.857143, 1.428571, 58.571429],
    'India': [27.777778, 5.555556, 0, 66.666667],
    'Newzealad': [50.000000, 0, 0, 35.714286],
    'Pakistan': [61.111111, 0, 0, 27.777778],
    'SouthAfrica':[60.000000,0,10.000000, 30.000000],
    'SriLanka':[52.941176, 0, 5.882353, 35.294118],
    'WestIndies':[62.500000, 0, 12.500000, 12.500000]
}


def survey(results, category_names):

    labels = list(results.keys())
    data = np.array(list(results.values()))
    data_cum = data.cumsum(axis=1)
    category_colors = plt.get_cmap('RdYlGn')(
        np.linspace(0.15, 0.85, data.shape[1]))

    fig, ax = plt.subplots(figsize=(9.2, 5))
    ax.invert_yaxis()
    ax.xaxis.set_visible(False)
    ax.set_xlim(0, np.sum(data, axis=1).max())

    for i, (colname, color) in enumerate(zip(category_names, category_colors)):
        widths = data[:, i]
        starts = data_cum[:, i] - widths
        ax.barh(labels, widths, left=starts, height=0.5,
                label=colname, color=color)
        xcenters = starts + widths / 2

        r, g, b, _ = color
        text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
        for y, (x, c) in enumerate(zip(xcenters, widths)):
            ax.text(x, y, str(int(c)), ha='center', va='center',
                    color=text_color)
    ax.legend(ncol=len(category_names), bbox_to_anchor=(0, 1),
              loc='lower left', fontsize='small')

    return fig, ax


survey(results, category_names)
plt.title('Status & Number of matches/Team',loc = 'right')
plt.show()

Fig7.

--> The above graph shows that,

  • in last few years Pakistan, West Indies and South Africa have not been performing well and so they have very less chance to win.
  • Similarly, India and England have been doing very good and so they have more chances of winning more number of matches in the world cup

Also, as mentioned in Fig 2, we see that England and India are evenly matched teams. India performs slightly better based on the stats. But the fact that England will be playing at home, hence, gives them an edge over India.

However, this data alone is not enough to analyse the winner of the cup. We will now analyse player's data to draw some more conclusions.

Batsmen Data Analysis

In [59]:
Batsman = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Batsman_Data.csv')
In [60]:
ground_Data = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Ground_Averages.csv')
odi_Scores_Data = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ODI_Match_Totals.csv')
odi_Results_Data = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ODI_Match_Results.csv')
wc_Players_Data = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/WC_players.csv')
bowler = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Bowler_data.csv')
In [61]:
WC_venue_pitches = ["The Oval, London","Trent Bridge, Nottingham","Sophia Gardens, Cardiff","County Ground, Bristol","Rose Bowl, Southampton","County Ground, Taunton","Old Trafford, Manchester","Edgbaston, Birmingham","Headingley, Leeds","Lord's, London","Riverside Ground, Chester-le-Street"]
In [62]:
WC_Ground_Stats = []
ODI_Grounds = odi_Scores_Data.Ground
for i in ODI_Grounds:
    for j in WC_venue_pitches:
        if i in j:
            #print("i ; ",i,"--j : ",j)
            WC_Ground_Stats.append((i,j))
In [63]:
stadiums_data = [item[0] for item in set(WC_Ground_Stats)]
In [64]:
World_Cup_Grounds=["Lord's, London - England", "The Rose Bowl, Southampton - England", "Trent Bridge, Nottingham - England", "Sophia Gardens, Cardiff - England", "Kennington Oval, London - England", "Edgbaston, Birmingham - England", "Old Trafford, Manchester - England", "Riverside Ground, Chester-le-Street - England", "Headingley, Leeds - England", "County Ground, Bristol - England", "County Ground, Taunton - England"]
World_Cup_Grounds
Worldcup_Ground_Stats = []
England_Grounds = ground_Data.Ground
for grounds in England_Grounds:
    for venues in World_Cup_Grounds :
        if grounds in venues:
            Worldcup_Ground_Stats.append((venues))
Worldcup_Ground_Stats
WorldCup_Grounds_Stats = ground_Data[ground_Data.Ground.isin([Ground for Ground in Worldcup_Ground_Stats])]
WorldCup_Grounds_Stats.head()
Out[64]:
Ground Span Mat Won Tied NR Runs Wkts Balls Ave RPO
34 Lord's, London - England 2013-2018 7 7 0 0 3549 110 3851 32.26 5.52
35 The Rose Bowl, Southampton - England 2013-2019 8 8 0 0 4766 103 4522 46.27 6.32
36 Trent Bridge, Nottingham - England 2013-2019 9 7 1 1 4944 112 4530 44.14 6.54
37 Sophia Gardens, Cardiff - England 2013-2018 14 13 1 0 6690 221 7264 30.27 5.52
38 Kennington Oval, London - England 2013-2019 17 15 0 2 8038 208 8369 38.64 5.76

--> The table above shows,

  • all the stats for all the matches held at a particular venue in England.
  • the Total number of matches played, the number of matches drawn, total number of runs scored on a partiular venue and so on.

This data can used to understand if a ground favours a bowler or a batsman. A low scoring ground is supposed to favour a bowler as not many runs are scored and a high scoring ground is supposed to favour the batsman.

Batsmen Performance In England - Analysis

In [65]:
Batsman.drop(columns=Batsman.columns[0],inplace=True)
Batsman = Batsman[~Batsman.Bat1.isin(["DNB","TDNB"])]
Batsman = Batsman[Batsman.Player_ID.isin(wc_Players_Data.ID)]
stadiums = [item[0] for item in set(WC_Ground_Stats)]
Batsman_in_England = Batsman[Batsman.Ground.isin(stadiums)]
Batsman_in_England.head()
Out[65]:
Bat1 Runs BF SR 4s 6s Opposition Ground Start Date Match_ID Batsman Player_ID
111 7 7 13 53.84 1 0 v England Southampton 16 Jun 2012 ODI # 3276 Andre Russell 276298
193 5* 5 6 83.33 1 0 v Pakistan The Oval 7 Jun 2013 ODI # 3364 Kemar Roach 230553
194 0* 0 8 0.00 0 0 v India The Oval 11 Jun 2013 ODI # 3368 Kemar Roach 230553
246 1 1 8 12.50 0 0 v England Manchester 19 Sep 2017 ODI # 3911 Ashley Nurse 315594
248 1 1 5 20.00 0 0 v England Bristol 24 Sep 2017 ODI # 3915 Ashley Nurse 315594

Computing averages of Batsmen on grounds in England

In [66]:
def Out_NotOut(value):
    if "*" in value:
        return 0
    else:
        return 1
Batsman_in_England["Out_NotOut"] = Batsman_in_England["Bat1"].apply(Out_NotOut)
#Batsman_in_England
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [67]:
Batsman_in_England["Runs"] = Batsman_in_England["Runs"].astype("int")
Batsman_in_England["BF"] = Batsman_in_England["BF"].astype("int")
Batsman_in_England["4s"] = Batsman_in_England["4s"].astype("int")
Batsman_in_England["6s"] = Batsman_in_England["6s"].astype("int")
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [68]:
Batsman_Data_Dummy = Batsman_in_England
Batsman_in_England = Batsman_in_England.groupby(["Ground","Batsman"]).sum().reset_index()
In [69]:
Batsman_in_England["Average"] = Batsman_in_England["Runs"]/Batsman_in_England.Out_NotOut
Batsman_in_England.head()
Out[69]:
Ground Batsman Runs BF 4s 6s Player_ID Out_NotOut Average
0 Birmingham Aaron Finch 76 82 8 0 10668 2 38.0
1 Birmingham Adam Zampa 0 3 0 0 379504 1 0.0
2 Birmingham Adil Rashid 69 50 7 2 244497 1 69.0
3 Birmingham Alex Hales 159 141 14 6 999464 3 53.0
4 Birmingham Angelo Mathews 86 88 9 0 99528 1 86.0

--> The above dataframe describes the batsmen data.

Here we look at the averages and strike rates of the batsmen.There are other columns as well that contain the data about 4s and 6s the batsman hit and how many times he remained 'Not out' during his innings on these grounds.

  • 4s: This shows the number of 4s scored. 4 runs are scored if the ball bounces before touching or going over the edge of the field.
  • 6s: This shows the number of 6s scored. 6 runs if the ball does not bounce before passing over the boundary in the air.
  • Average: A player's batting average is the total number of runs they have scored divided by the number of times they have been out.
  • Strike Rate: Strike rate is a measure of how quickly a batsman scores runs.
In [70]:
Batsman_Data = Batsman_in_England.groupby(["Batsman"]).sum().reset_index()
Batsman_Data["Average"] = Batsman_Data["Runs"]/Batsman_Data["Out_NotOut"]
Batsman_Data.sort_values(by = "Average",ascending=False).sample(5)
Out[70]:
Batsman Runs BF 4s 6s Player_ID Out_NotOut Average
21 Dhananjaya de Silva 1 6 0 0 465793 1 1.000000
62 Mahmudullah 207 271 14 3 392175 4 51.750000
4 Alex Hales 1304 1256 162 29 9744774 36 36.222222
18 David Miller 289 294 15 12 2895993 6 48.166667
38 James Neesham 47 48 4 1 1065807 3 15.666667
In [71]:
Batsman_Average_Best = Batsman_Data[(Batsman_Data.Out_NotOut>0) & (Batsman_Data.Average > 40 )].sort_values(by = "Average",ascending = False)
Batsman_Average_Best.head()
Out[71]:
Batsman Runs BF 4s 6s Player_ID Out_NotOut Average
35 Imam-ul-Haq 234 263 24 1 2273104 2 117.000000
25 Evin Lewis 200 152 18 9 1295703 2 100.000000
39 Jason Holder 152 121 10 7 1174455 2 76.000000
81 Ravindra Jadeja 281 258 29 5 2346750 4 70.250000
86 Sarfaraz Ahmed 465 502 38 1 2277600 7 66.428571
In [72]:
Batsman_NoDuplicate = Batsman[["Player_ID","Batsman"]].drop_duplicates()
In [73]:
PlayerID = list(Batsman_Average_Best.merge(Batsman_NoDuplicate,how = "left",on = "Batsman")["Player_ID_y"].astype("int"))
Batsman_Average_Best["Player_ID"] = PlayerID
wc_Players_Data.columns = ["Player", "Player_ID","Country"]
Player_Country = list(Batsman_Average_Best.merge(wc_Players_Data,how = "left",on = "Player_ID")["Country"])
Batsman_Average_Best["Country"] = Player_Country
Batsman_Average_Best.head()
Out[73]:
Batsman Runs BF 4s 6s Player_ID Out_NotOut Average Country
35 Imam-ul-Haq 234 263 24 1 568276 2 117.000000 Pakistan
25 Evin Lewis 200 152 18 9 431901 2 100.000000 WestIndies
39 Jason Holder 152 121 10 7 391485 2 76.000000 WestIndies
81 Ravindra Jadeja 281 258 29 5 234675 4 70.250000 India
86 Sarfaraz Ahmed 465 502 38 1 227760 7 66.428571 Pakistan

Computing the Strike Rates for Batsmen

In [74]:
# Calculation for computing the Strike Rate
Batsman_Average_Best["Strike_Rate"] = Batsman_Average_Best["Runs"]/Batsman_Average_Best["BF"]*100
Batsman_Average_Best.head(10)
Out[74]:
Batsman Runs BF 4s 6s Player_ID Out_NotOut Average Country Strike_Rate
35 Imam-ul-Haq 234 263 24 1 568276 2 117.000000 Pakistan 88.973384
25 Evin Lewis 200 152 18 9 431901 2 100.000000 WestIndies 131.578947
39 Jason Holder 152 121 10 7 391485 2 76.000000 WestIndies 125.619835
81 Ravindra Jadeja 281 258 29 5 234675 4 70.250000 India 108.914729
86 Sarfaraz Ahmed 465 502 38 1 227760 7 66.428571 Pakistan 92.629482
46 Jonny Bairstow 1439 1295 166 23 297433 22 65.409091 England 111.119691
92 Shikhar Dhawan 976 966 118 13 28235 15 65.066667 India 101.035197
52 Kane Williamson 815 849 84 7 277906 13 62.692308 NewZealand 95.995289
82 Rohit Sharma 687 829 72 13 34102 12 57.250000 India 82.870929
33 Hashim Amla 851 941 94 5 43906 15 56.733333 SouthAfrica 90.435707
In [75]:
Batsman_Average_Best.sort_values(["Strike_Rate"],ascending = False).head(10)
Out[75]:
Batsman Runs BF 4s 6s Player_ID Out_NotOut Average Country Strike_Rate
25 Evin Lewis 200 152 18 9 431901 2 100.000000 WestIndies 131.578947
39 Jason Holder 152 121 10 7 391485 2 76.000000 WestIndies 125.619835
47 Jos Buttler 1654 1358 147 47 308967 32 51.687500 England 121.796760
28 Fakhar Zaman 452 394 50 11 512191 8 56.500000 Pakistan 114.720812
34 Imad Wasim 224 200 22 5 227758 5 44.800000 Pakistan 112.000000
46 Jonny Bairstow 1439 1295 166 23 297433 22 65.409091 England 111.119691
40 Jason Roy 1686 1539 187 34 298438 38 44.368421 England 109.551657
81 Ravindra Jadeja 281 258 29 5 234675 4 70.250000 India 108.914729
92 Shikhar Dhawan 976 966 118 13 28235 15 65.066667 India 101.035197
91 Shaun Marsh 372 374 28 11 6683 8 46.500000 Australia 99.465241

--> Findings from data above:

  • On the basis of Averages and Strike Rates, Batsmen from England dominate the strike rate department. Out of these English Batsmen, Johnny Bairstow and Jos Buttler have respectable averages as well.
  • In the top ten Batsmen with High averages, there are many Indian Batsmen with a high averages. However that is mostly because of the less number of matches played by them as compared to Johnny Bairstow.

Hence looking at the data above, we think that England will have an advantage in the Batting department. Below, we will create a visualization representing this finding

Let's Visualize

In [76]:
import matplotlib.pyplot as plt

Pie_Batsmen = pd.DataFrame(Batsman_Average_Best["Country"].value_counts(), columns=["Country"])
Pie_Batsmen.index.name="Name"
plt.pie(Pie_Batsmen["Country"],labels=Pie_Batsmen.index,autopct='%1.1f%%')
plt.axis('equal')
plt.title('Percentage of Top Ranked Batsmen in a Team')
plt.show()

Fig8.

--> Here we see that,

  • the maximum number of batsmen with high averages belong to England. It is normal to observe this as England are the host nation and they have played most of their matches in England only. However, Pakistan and India are close behind and are tied in the 2nd position.
  • India and Pakistan are among the top ranked teams in the world and have batsmen who perform well in many different grounds around the world. From this data, we could conclude that England have the best Batsmen among the teams participating in the world as they have a Home Advantage.

Bowler's Data Analysis

In [77]:
# Filtering the bowlers data for the matches that were played in england
bowler = bowler[bowler.Ground.isin(stadiums)]
In [78]:
# Removing the rows from data where the overs is blank (-)
bowler = bowler[~bowler.Overs.str.contains('-')]
In [79]:
bowler.head()
Out[79]:
Unnamed: 0 Overs Mdns Runs Wkts Econ Ave SR Opposition Ground Start Date Match_ID Bowler Player_ID
6 7 7.0 0 52 2 7.42 26.00 21.0 v England The Oval 28 Jun 2011 ODI # 3165 Suranga Lakmal 49619
7 8 7.5 0 43 3 5.48 14.33 15.6 v England Leeds 1 Jul 2011 ODI # 3167 Suranga Lakmal 49619
8 9 10.0 0 62 2 6.20 31.00 30.0 v England Lord's 3 Jul 2011 ODI # 3168 Suranga Lakmal 49619
9 10 2.0 0 12 0 6.00 - - v England Nottingham 6 Jul 2011 ODI # 3169 Suranga Lakmal 49619
26 27 6.0 0 34 0 5.66 - - v England The Oval 22 May 2014 ODI # 3492 Suranga Lakmal 49619

--> In the above DataFrame, the meaning of following column names is mentioned below:

  • Econ: Economy - The runs that the bowler concedes in a single over (6 balls)
  • Ave: Average - The runs that the bowler concedes per wicket
  • Mdns: Maiden - The over where the bowler conceded 0 runs
  • SR: Strike Rate - The number of balls bowled before the bowler got the wicket
In [80]:
# Total number of balls bowled by the bowler
def total_balls_bowled(value):
    if "." in value:
        over = value.split(".")
        return int(over[0]) * 6 + int(over[1])
    else:
        return int(value) * 6
In [81]:
Ground_names = dict(set(WC_Ground_Stats))
def Full_Ground_names(value):
    return Ground_names[value]
Ground_names
Out[81]:
{'Cardiff': 'Sophia Gardens, Cardiff',
 'Bristol': 'County Ground, Bristol',
 'Southampton': 'Rose Bowl, Southampton',
 "Lord's": "Lord's, London",
 'Leeds': 'Headingley, Leeds',
 'Manchester': 'Old Trafford, Manchester',
 'Nottingham': 'Trent Bridge, Nottingham',
 'The Oval': 'The Oval, London',
 'Birmingham': 'Edgbaston, Birmingham',
 'Chester-le-Street': 'Riverside Ground, Chester-le-Street'}
In [82]:
# Sum of the bowler stats from all the matches played on all the grounds 
bowler["Balls"] = bowler.Overs.apply(total_balls_bowled)
for i in ["Runs","Mdns","Wkts","Balls"]:
    bowler[i] = bowler[i].astype("float")
Bowlers_in_England = bowler.groupby(["Bowler"]).sum()[["Runs","Mdns","Wkts","Balls"]].reset_index()
In [83]:
Bowlers_in_England.head()
Out[83]:
Bowler Runs Mdns Wkts Balls
0 Aaron Finch 7.0 0.0 0.0 6.0
1 Adam Zampa 65.0 1.0 2.0 74.0
2 Adil Rashid 2219.0 3.0 72.0 2382.0
3 Andile Phehlukwayo 158.0 1.0 3.0 144.0
4 Andre Russell 43.0 0.0 0.0 36.0
In [84]:
# From the sum above, we are now calculating the stats of a bowler for only the grounds in England
Bowlers_in_England["Economy"] = Bowlers_in_England.Runs * 6 /Bowlers_in_England.Balls
Bowlers_in_England["Average"] = Bowlers_in_England.Runs/ Bowlers_in_England.Wkts
Bowlers_in_England["Strike_Rate"] = Bowlers_in_England.Balls / Bowlers_in_England.Wkts
In [85]:
Bowlers_in_England.head()
Out[85]:
Bowler Runs Mdns Wkts Balls Economy Average Strike_Rate
0 Aaron Finch 7.0 0.0 0.0 6.0 7.000000 inf inf
1 Adam Zampa 65.0 1.0 2.0 74.0 5.270270 32.500000 37.000000
2 Adil Rashid 2219.0 3.0 72.0 2382.0 5.589421 30.819444 33.083333
3 Andile Phehlukwayo 158.0 1.0 3.0 144.0 6.583333 52.666667 48.000000
4 Andre Russell 43.0 0.0 0.0 36.0 7.166667 inf inf

Since we want to look at the data of only the best bowlers in a team, we are removing the data of all the bowlers who have bowled less than 10 Overs in England. Also, we are deleting the records for the bowlers who have taken 0 wickets.

In [86]:
Bowlers_in_England = Bowlers_in_England[(Bowlers_in_England.Balls > 60) & (Bowlers_in_England.Wkts > 0)]
Bowlers_in_England.head()
Out[86]:
Bowler Runs Mdns Wkts Balls Economy Average Strike_Rate
1 Adam Zampa 65.0 1.0 2.0 74.0 5.270270 32.500000 37.000000
2 Adil Rashid 2219.0 3.0 72.0 2382.0 5.589421 30.819444 33.083333
3 Andile Phehlukwayo 158.0 1.0 3.0 144.0 6.583333 52.666667 48.000000
5 Angelo Mathews 232.0 1.0 8.0 306.0 4.549020 29.000000 38.250000
6 Ashley Nurse 202.0 0.0 1.0 167.0 7.257485 202.000000 167.000000
In [87]:
unique_bowler = bowler[['Player_ID','Bowler']].drop_duplicates()
Bowlers_in_England = Bowlers_in_England.merge(unique_bowler,how = "left",on = "Bowler")
wc_Players_Data.columns = ["Player", "Player_ID","Country"]
Country_Player = list(Bowlers_in_England.merge(wc_Players_Data,how = "left",on = "Player_ID")["Country"])
Bowlers_in_England["Country"] = Country_Player
Bowlers_in_England.iloc[55,-1] = "SriLanka"
In [88]:
Bowlers_in_England.head()
Out[88]:
Bowler Runs Mdns Wkts Balls Economy Average Strike_Rate Player_ID Country
0 Adam Zampa 65.0 1.0 2.0 74.0 5.270270 32.500000 37.000000 379504 Australia
1 Adil Rashid 2219.0 3.0 72.0 2382.0 5.589421 30.819444 33.083333 244497 England
2 Andile Phehlukwayo 158.0 1.0 3.0 144.0 6.583333 52.666667 48.000000 540316 SouthAfrica
3 Angelo Mathews 232.0 1.0 8.0 306.0 4.549020 29.000000 38.250000 49764 SriLanka
4 Ashley Nurse 202.0 0.0 1.0 167.0 7.257485 202.000000 167.000000 315594 WestIndies

--> We can say that the bowler with the highest number of maiden overs is a good bowler and has the potential to make his team win matches. So we find out the data for the top 10 bowlers with maiden overs

In [89]:
Bowlers_in_England.sort_values(by = ["Mdns"], ascending=False)[:10]
Out[89]:
Bowler Runs Mdns Wkts Balls Economy Average Strike_Rate Player_ID Country
31 Lasith Malinga 1039.0 12.0 36.0 1126.0 5.536412 28.861111 31.277778 49758 SriLanka
9 Chris Woakes 1185.0 12.0 33.0 1257.0 5.656325 35.909091 38.090909 247235 England
6 Bhuvneshwar Kumar 517.0 11.0 18.0 693.0 4.476190 28.722222 38.500000 326016 India
34 Mark Wood 1234.0 10.0 28.0 1351.0 5.480385 44.071429 48.250000 351588 England
57 Tim Southee 704.0 10.0 25.0 761.0 5.550591 28.160000 30.440000 232364 NewZealand
11 David Willey 1221.0 8.0 39.0 1221.0 6.000000 31.307692 31.307692 308251 England
35 Mashrafe Mortaza 480.0 6.0 8.0 606.0 4.752475 60.000000 75.750000 56007 Bangladesh
25 Kagiso Rabada 268.0 5.0 8.0 312.0 5.153846 33.500000 39.000000 550215 SouthAfrica
39 Moeen Ali 1451.0 5.0 32.0 1607.0 5.417548 45.343750 50.218750 8917 England
48 Ravindra Jadeja 729.0 5.0 27.0 852.0 5.133803 27.000000 31.555556 234675 India

--> In the above DataFrame,

  • we can see the total number of maiden overs bowled by a bowler but this is not a true measure of a bowlers performance. This is because a bowler could have bowled 100 overs in total and out of those 10 could be maiden whereas some other bowler might have bowled 10 maiden overs out of a total of 50 overs.
  • Thus we need to look at the percentage of the maiden overs bowled to truly understand the performance of the bowler.
In [90]:
Bowlers_in_England["Percentage_Of_Maiden_Overs"] = ((Bowlers_in_England.Mdns*6)/(Bowlers_in_England.Balls))*100
Bowlers_in_England.sort_values(by = ['Percentage_Of_Maiden_Overs'], ascending = False).head(10)
Out[90]:
Bowler Runs Mdns Wkts Balls Economy Average Strike_Rate Player_ID Country Percentage_Of_Maiden_Overs
29 Kemar Roach 75.0 4.0 3.0 96.0 4.687500 25.000000 32.000000 230553 WestIndies 25.000000
25 Kagiso Rabada 268.0 5.0 8.0 312.0 5.153846 33.500000 39.000000 550215 SouthAfrica 9.615385
6 Bhuvneshwar Kumar 517.0 11.0 18.0 693.0 4.476190 28.722222 38.500000 326016 India 9.523810
0 Adam Zampa 65.0 1.0 2.0 74.0 5.270270 32.500000 37.000000 379504 Australia 8.108108
57 Tim Southee 704.0 10.0 25.0 761.0 5.550591 28.160000 30.440000 232364 NewZealand 7.884363
44 Nathan Coulter-Nile 123.0 2.0 4.0 156.0 4.730769 30.750000 39.000000 261354 Australia 7.692308
31 Lasith Malinga 1039.0 12.0 36.0 1126.0 5.536412 28.861111 31.277778 49758 SriLanka 6.394316
35 Mashrafe Mortaza 480.0 6.0 8.0 606.0 4.752475 60.000000 75.750000 56007 Bangladesh 5.940594
9 Chris Woakes 1185.0 12.0 33.0 1257.0 5.656325 35.909091 38.090909 247235 England 5.727924
24 Junaid Khan 399.0 4.0 11.0 429.0 5.580420 36.272727 39.000000 259551 Pakistan 5.594406

--> The above table now gives us the Top 10 Bowlers based on Maidens bowled

In [91]:
Bowlers_in_England.sort_values(by = ["Average"], ascending  = True).head(10)
Out[91]:
Bowler Runs Mdns Wkts Balls Economy Average Strike_Rate Player_ID Country Percentage_Of_Maiden_Overs
30 Kuldeep Yadav 148.0 0.0 9.0 180.0 4.933333 16.444444 20.000000 559235 India 0.000000
41 Mohammed Shami 152.0 1.0 8.0 195.0 4.676923 19.000000 24.375000 481896 India 3.076923
28 Kedar Jadhav 67.0 0.0 3.0 72.0 5.583333 22.333333 24.000000 290716 India 0.000000
42 Mosaddek Hossain 73.0 0.0 3.0 74.0 5.918919 24.333333 24.666667 550133 Bangladesh 0.000000
29 Kemar Roach 75.0 4.0 3.0 96.0 4.687500 25.000000 32.000000 230553 WestIndies 25.000000
26 Kane Richardson 156.0 1.0 6.0 156.0 6.000000 26.000000 26.000000 272262 Australia 3.846154
59 Trent Boult 240.0 1.0 9.0 264.0 5.454545 26.666667 29.333333 277912 NewZealand 2.272727
48 Ravindra Jadeja 729.0 5.0 27.0 852.0 5.133803 27.000000 31.555556 234675 India 3.521127
15 Hasan Ali 623.0 3.0 23.0 666.0 5.612613 27.086957 28.956522 681305 Pakistan 2.702703
54 Steve Smith 191.0 0.0 7.0 228.0 5.026316 27.285714 32.571429 267192 Australia 0.000000

--> The above table gives us the Top 10 bowlers based on the 'Average'

In [92]:
Bowlers_in_England.sort_values(by = ["Economy"])[:10]
Out[92]:
Bowler Runs Mdns Wkts Balls Economy Average Strike_Rate Player_ID Country Percentage_Of_Maiden_Overs
45 Nathan Lyon 70.0 0.0 1.0 102.0 4.117647 70.000000 102.000000 272279 Australia 0.000000
53 Shoaib Malik 376.0 3.0 11.0 504.0 4.476190 34.181818 45.818182 42657 Pakistan 3.571429
6 Bhuvneshwar Kumar 517.0 11.0 18.0 693.0 4.476190 28.722222 38.500000 326016 India 9.523810
61 Yuzvendra Chahal 135.0 0.0 2.0 180.0 4.500000 67.500000 90.000000 430246 India 0.000000
3 Angelo Mathews 232.0 1.0 8.0 306.0 4.549020 29.000000 38.250000 49764 SriLanka 1.960784
40 Mohammad Hafeez 540.0 1.0 10.0 711.0 4.556962 54.000000 71.100000 41434 Pakistan 0.843882
41 Mohammed Shami 152.0 1.0 8.0 195.0 4.676923 19.000000 24.375000 481896 India 3.076923
29 Kemar Roach 75.0 4.0 3.0 96.0 4.687500 25.000000 32.000000 230553 WestIndies 25.000000
7 Chris Gayle 446.0 2.0 15.0 566.0 4.727915 29.733333 37.733333 51880 WestIndies 2.120141
44 Nathan Coulter-Nile 123.0 2.0 4.0 156.0 4.730769 30.750000 39.000000 261354 Australia 7.692308

--> The above table gives us the Top 10 Bowlers based on the 'Economy'

In [93]:
Bowlers_in_England.sort_values(by = ["Strike_Rate"])[:10]
Out[93]:
Bowler Runs Mdns Wkts Balls Economy Average Strike_Rate Player_ID Country Percentage_Of_Maiden_Overs
30 Kuldeep Yadav 148.0 0.0 9.0 180.0 4.933333 16.444444 20.000000 559235 India 0.000000
28 Kedar Jadhav 67.0 0.0 3.0 72.0 5.583333 22.333333 24.000000 290716 India 0.000000
41 Mohammed Shami 152.0 1.0 8.0 195.0 4.676923 19.000000 24.375000 481896 India 3.076923
42 Mosaddek Hossain 73.0 0.0 3.0 74.0 5.918919 24.333333 24.666667 550133 Bangladesh 0.000000
58 Tom Curran 211.0 1.0 7.0 180.0 7.033333 30.142857 25.714286 550235 England 3.333333
26 Kane Richardson 156.0 1.0 6.0 156.0 6.000000 26.000000 26.000000 272262 Australia 3.846154
15 Hasan Ali 623.0 3.0 23.0 666.0 5.612613 27.086957 28.956522 681305 Pakistan 2.702703
21 Jeevan Mendis 170.0 0.0 6.0 174.0 5.862069 28.333333 29.000000 49700 SriLanka 0.000000
59 Trent Boult 240.0 1.0 9.0 264.0 5.454545 26.666667 29.333333 277912 NewZealand 2.272727
57 Tim Southee 704.0 10.0 25.0 761.0 5.550591 28.160000 30.440000 232364 NewZealand 7.884363

--> The above table gives us the Top 10 wicket taker bowlers

  • Based on strike Rate
  • The top 3 bowlers are fom India.
In [94]:
aggregations = {
    'Runs':'sum',
    'Mdns':'sum',
    'Wkts':'sum',
    'Balls':'sum',
    'Economy': 'mean',
    'Average':'mean',
    'Strike_Rate':'mean',
    'Percentage_Of_Maiden_Overs':'mean'
}
Bowlers_in_England_TeamWise_Data = Bowlers_in_England.groupby('Country').agg(aggregations).reset_index()
Bowlers_in_England_TeamWise_Data
Out[94]:
Country Runs Mdns Wkts Balls Economy Average Strike_Rate Percentage_Of_Maiden_Overs
0 Australia 2175.0 12.0 60.0 2300.0 5.631351 42.014177 46.166234 3.249520
1 Bangladesh 1358.0 6.0 22.0 1484.0 5.778804 79.500000 81.583333 1.188119
2 England 11467.0 43.0 318.0 11942.0 5.920822 40.538542 41.417132 2.381099
3 India 2467.0 19.0 77.0 2901.0 5.255050 47.922222 52.581173 2.236536
4 NewZealand 1782.0 11.0 53.0 1769.0 6.263260 37.813667 35.754667 2.031418
5 Pakistan 2919.0 12.0 72.0 3306.0 5.781898 58.692688 57.727394 1.728458
6 SouthAfrica 1450.0 12.0 39.0 1621.0 5.460761 38.806944 42.541667 3.914835
7 SriLanka 2592.0 14.0 78.0 2721.0 5.701535 39.753114 42.422306 1.564634
8 WestIndies 885.0 6.0 21.0 969.0 5.903939 84.433333 76.683333 6.780035

--> The above table represents the bowlers stats team wise

In [95]:
plt.figure(figsize=(15,10))
sns.boxplot(x = "Country", y = "Economy", data = Bowlers_in_England).set_title("Average Economy Rate - Team Wise")
Out[95]:
Text(0.5,1,'Average Economy Rate - Team Wise')

Fig9.

--> From the above Box Plot we observe that,

  • teams like India, Australia and South Africa have low Economy rates. This means that these teams give away lesser runs in overs and hence are able to restrict other teams from scoring big runs, making it easier for these teams to defend their totals and confine opponent teams to a low runs total.
  • England has an average bowling performance.
  • West Indies are not at all consistent as there is a huge deviation in their Economy rates. Evident from the Box plot
In [96]:
plt.figure(figsize=(15,10))
sns.boxenplot(x = "Country", y = "Strike_Rate", data = Bowlers_in_England).set_title("Average Strike Rate - Team Wise")
Out[96]:
Text(0.5,1,'Average Strike Rate - Team Wise')

Fig10.

--> We observe that,

  • Most of the teams take around 40 to 60 balls to take a wicket.
  • New Zealand and South Africa are the most consistent teams when it comes to taking wickets.
  • Bangladesh are not very good at taking wickets. Their average Strike Rate is around 75 balls and it varies from 25 to 170 balls. They are not very consistent when ot comes to bowling.
In [97]:
plt.figure(figsize=(15,8))
g = sns.lineplot( data = Bowlers_in_England_TeamWise_Data[["Economy","Percentage_Of_Maiden_Overs"]])
g.set_xticklabels(["Australia"]+[item for item in Bowlers_in_England_TeamWise_Data.Country])
plt.title('Comparison between Economy and Perc. Of Maiden Overs')
Out[97]:
Text(0.5,1,'Comparison between Economy and Perc. Of Maiden Overs')

Fig11.

--> The above line graph gives us the following details:

  • Teams like India and South Africa have a good percentage of Maiden Overs bowled
  • India and South Africa have low Economy rates. This means that these teams give away few Runs per Over and hence can restrict their opponents to low runs total
  • South Africa seems to be the best team in terms of Bowling performance

Section4: Conclusion

In this analysis, we have focused on three different parameters to predict the outcome of the tournament. The analysis of the team, batsmen and bowler’s performance in the world cup venues (England in this case). These analyses gave us some insight into which team would have an edge in the tournament.

  • Based on the venue, we could see that England had an edge on all the other teams as they had a huge Home Advantage. They were familiar with the pitches and the ground conditions whereas all the 'Away' teams had to adapt to the new conditions.
  • We found out the top 10 batsmen from all the coutries and we could see that the the stats favoured the English Batsmen. Their Strike rate and Averages on the grounds in England were the highest. This meant that England had the best chances of scoring the maximum runs in a match.
  • Also, on comparing the bowlers data on the basis of various factors like the Economy, Strike Rate and Average, we find out that even though the English bowlers are not the best in the world, there are players like 'Chris Woakes' that had a huge contribution in England's victory. He was among the top 5 bowlers who bowled the maximum numbers of maiden overs and also mantained a good Economy rate.

Based on the analysis above, we conclude that the home team 'England' is our pick of the lot.

Traditionally, this has always been the case that the host nation tends to start a tournament as favorites. The last two world cups had also been won by the host nation. The fact that the host nation plays maximum number of matches at their home gives them the edge as they are well verse with the conditions at play.

Our model will work true for any world cup in the future as well. We would just need to update the venue details in our data set and add the corresponding data about the teams and the venue to our data files.